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Abstract 

Emerging  supercomputers  strive  to  achieve  an  ever  increasing  performance  metric  at  the  cost  of  excessive  power 
consumption  and  heat  production.  This  expensive  trend  has  prompted  an  increased  interest  in  green  computing.  Green 
computing  emphasizes  the  importance  of  energy  conservation,  minimizing  the  negative  impact  on  the  environment  while 
achieving  maximum  performance  and  minimizing  operating  costs. 

The  Condor  Cluster,  a  heterogeneous  supercomputer  composed  of  Intel  Xeon  X5650  processors.  Cell  Broadband  Engine 
processors,  and  NVIDIA  general  purpose  graphical  processing  units  was  engineered  by  the  Air  Force  Research 
Laboratory ’s  Information  Directorate  and  funded  with  a  DoD  Dedicated  High  Performance  Computer  Project  Investment 
(DHPI).  The  500  TeraFLOPS  Condor  was  designed  to  be  comparable  to  the  top  performing  supercomputers  using  only  a 
fraction  of  the  power.  The  objective  of  this  project  was  to  determine  the  energy  efficiency  as  a  function  of  performance  per 
Watt  of  Condor. 

The  energy  efficiency  of  Condor  was  determined  using  the  GreenSOO  test  methodology,  in  particular  measuring  power 
consumption  during  maximum  performance  on  the  High  Performance  UNPACK  (HPL)  Benchmark.  The  HPL  Benchmark 
measures  computing  performance  in  floating  point  operations  per  second  while  solving  random  dense  linear  equations.  A 
power  meter  was  used  to  measure  the  average  energy  consumption  of  a  single  node  of  the  system  over  the  duration  of  the 
execution  time  of  the  benchmark.  Using  the  energy  consumption  from  a  single  node  and  assuming  each  node  to  draw  equal 
amounts  of  energy,  the  efficiency  performance  of  the  entire  system  was  calculated.  We  demonstrate  that  Condor  achieves 
an  energy  efficiency  performance  comparable  to  the  top  supercomputers  on  the  GreenSOO  List. 

1.  Introduction 

The  past  20  years  have  seen  a  seemingly  unstoppable  inerease  in  eomputer  performanee;  we  have  witnessed  a 
remarkable  10,000  fold  improvement  in  the  peak  performanee  of  a  high  end  supereomputer  (Feng  and  Cameron,  2007). 
The  drive  in  eomputer  advaneements  has  been  strietly  performanee-based,  doing  anything  neeessary  to  aehieve  a  maximum 
number  of  floating  point  operations  per  seeond  (FLOPS).  What  has  not  been  heavily  eonsidered  throughout  these 
advaneements  however,  is  the  energy  effieieney  of  the  supereomputer.  Green  eomputing  takes  a  new  view  of  high 
performanee  eomputing  by  eonsidering  the  energy  eonsumption  required  to  aehieve  maximum  performanee  goals  (Feng  et 
al.,  2008). 

The  drive  for  energy  effieient  eomputing  has  been  inereasingly  present  in  the  past  several  years  for  a  number  of 
reasons.  We  eontinue  to  observe  an  immense  inerease  in  the  peak  performanee  of  eomputers  on  the  TOP500  list  over  the 
years.  While  this  speedup  is  a  great  feat,  the  eost  to  run  these  powerful  eomputers  is  an  unavoidable  roadbloek  for 
sustainability  in  terms  of  total  eost  of  ownership.  Consider  that  the  priee  of  eleetrieal  energy  per  megawatt  is  estimated  to 
be  approximately  $1  million  per  year  (Feng  et  al.,  2008).  Aeeording  to  the  TOP500  list  of  November  2010,  the  top 
performing  supereomputer  in  the  world  used  4.04  MW  of  power;  and  this  system  was  not  the  most  power  hungry  on  the  list 
(http  ://www.top500 .  org/list/20 10/22/100). 


1 


Since  2006,  there  has  been  an  evident  drive  for  energy  efficient  computing  by  the  US  Government.  In  December  of 
2006  the  U.S.  Congress  passed  Public  Law  109-431  “to  study  and  promote  the  use  of  energy  efficient  computer  servers  in 
the  United  States”  (http://energystar.gov).  The  law  emphasizes  the  need  for  energy  efficient  improvements  for  government 
and  commercial  servers  and  data  centers  and  required  a  study  be  done  by  the  Environmental  Protection  Agency  (EPA) 
Energy  Star  Program  to  analyze  the  areas  of  potential  impacts  in  energy  efficiency  improvements,  as  well  as 
recommendations  for  incentive  programs  to  advance  the  transition  to  energy  efficient  computing.  The  EPA  Energy  Star 
program  submitted  the  “Report  to  Congress  on  Server  and  Data  Center  Energy  Efficiency”  in  2007  where  energy  use  and 
cost  for  data  centers  in  the  U.S.  was  extensively  examined  and  prospective  areas  of  improvement  were  addressed 
(http://www.energystar.gov).  In  addition,  AMD,  Dell,  IBM,  Sun  Microsystems,  and  VMware  formed  the  Green  Grid 
consortium  in  2007.  The  mission  of  the  Green  Grid  is  to  improve  the  energy  efficiency  of  data  servers  and  computer 
ecosystems  (Kurp,  2008). 

The  GreenSOO  List  was  started  in  April  2005  to  encourage  energy  efficiency  as  a  first-class  design  consideration  in 
emerging  supercomputer  construction  and  to  provide  a  ranking  of  the  top  performing  supercomputers  with  respect  to  an 
energy  efficiency  metric  (www.green500.org).  Similar  to  the  well-known  TOP500  List  that  ranks  high  performance 
computers  based  on  peak  performance,  the  Green500  list  measures  the  peak  performance  of  a  system  running  the  High 
Performance  LINPACK  (HPL)  benchmark  while  also  measuring  the  energy  consumed  to  achieve  such  performance. 
Supercomputers  are  ranked  by  MegaFLOPS  (MFLOPS)  per  Watt,  with  the  minimum  criteria  to  be  accepted  on  the 
Green500  List  being  that  the  supercomputer  must  achieve  HPL  performance  great  enough  to  appear  on  the  most  recent 
TOP500  list. 

With  energy  efficiency  in  mind,  the  Air  Force  Research  Laboratory’s  Information  Directorate  engineered  the  Condor 
Cluster,  a  heterogeneous  supercomputer  composed  of  Intel  Xeon  X5650  processors.  Cell  Broadband  Engine  (Cell  BE) 
processors,  and  NVIDIA  general  purpose  graphical  processing  units.  This  project  was  funded  with  a  DoD  Dedicated  High 
Performance  Computer  Project  Investment  (DHPI)  and  has  a  theoretical  single  precision  peak  performance  of  500 
TeraFLOPS  (TFLOPS).  This  paper  examines  the  energy  efficiency  of  Condor  using  the  Run  Rules  for  the  Green500  List, 
to  demonstrate  the  total  cost  of  ownership  efficiency  of  this  unique  system  design. 

2.  The  Condor  Cluster 

The  Condor  Cluster  is  a  heterogeneous  supercomputer  composed  of  94  NVIDIA  Tesla  C2050’s,  62  NVIDIA  Tesla 
C1060’s,  78  Intel  Xeon  X5650  dual  socket  processors,  and  1716  Sony  PlayStation  3s  (PS3s),  adding  up  to  a  total  of  69,940 
cores  and  a  theoretical  peak  performance  of  500  TFLOPS.  There  are  84  subcluster  head  nodes,  of  which  six  are  gateway 
nodes  that  do  not  perform  computations,  while  the  other  78  compute  head  nodes  are  capable  of  230  TFLOPS  of  theoretical 
peak  processing  performance.  Each  of  the  78  compute  head  nodes  are  composed  of  two  NVIDIA  general  purpose  graphical 
processing  units  (GPGPUs)  and  one  Intel  Xeon  X5650  dual  socket  hexa-core  processor  (i.e.  12  cores  per  Xeon).  Of  the  78 
compute  head  nodes,  47  contain  dual  NVIDIA  Tesla  C2050  GPGPUs  while  31  contain  dual  NVIDIA  Tesla  C1060 
GPGPUs.  The  head  nodes  are  connected  to  each  other  via  40  Gbps  InfiniBand  and  10Gb  Ethernet.  Additionally,  each 
compute  node  is  connected  to  a  lOGbE/lGbE  aggregator  that  provides  communication  to  a  subcluster  of  22  PS3s.  In  total, 
the  PS3s  can  achieve  a  theoretical  peak  performance  of  270  TFLOPS. 

The  NVIDIA  GPGPUs  in  Condor,  the  Tesla  Cl 060  and  the  newer  model  Tesla  C2050,  share  similar  architectures  but 
vary  in  performance.  Both  the  C1060  and  C2050  have  the  same  Tesla  architecture  based  on  a  scalable  processor  array 
(Lindholm,  2008).  The  architecture  can  be  broken  down  into  independent  processing  units  called  texture/processor  clusters 
(TPCs).  The  TPCs  are  made  up  of  streaming  multiprocessors  which  perform  the  calculations  for  the  GPGPU.  The 
streaming  multiprocessors  can  be  broken  down  further  into  streaming  processors  or  cores;  these  are  the  main  units  of  the 
architecture  (Maciol,  2008).  The  GPGPU  communicates  with  the  CPU  via  the  host  interface  (Lindholm,  2008).  However, 
the  C1060  model  has  240  cores  while  the  C2050  has  448  cores  (http://www.nvidia.com). 

The  Intel  Xeon  processors  on  each  head  node  are  built  on  the  energy  efficient  Intel  Nehalem  microarchitecture.  This 
architecture  was  made  with  several  Intel  technologies  that  adjust  performance  and  power  usage  based  on  application  needs. 
When  not  in  use,  the  processor  is  capable  of  drawing  a  minimal  amount  of  power  and  also  capable  of  operating  above  the 
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rated  frequency  when  necessary.  The  Condor  Cluster  is  equipped  with  the  six-core  Intel  Xeon  dual  socket  X5650,  giving  a 
total  of  12  cores  per  processor  (http://www.intel.com). 

The  Sony  Toshiba  IBM  (STI)  Cell  BE  is  a  nine  core  heterogeneous  processor  that  consists  of  one  PowerPC  Processing 
Element  (PPE)  and  eight  Synergistic  Processing  Elements  (SPEs).  The  PPE  is  based  on  the  open  source  IBM  Power 
Architecture  processor  and  is  responsible  for  controlling  and  coordinating  the  SPE  tasks  and  runs  the  operating  systems  on 
the  processor  (Buttari  et  ah,  2007).  The  eight  SPEs  are  responsible  for  the  majority  of  the  compute  power  on  the  processor 
(Gschwind  et  ah,  2007).  All  code  executed  by  the  SPE  is  done  in  the  256  KB  software  controlled  local  store  (Buttari  et  ah, 
2007).  The  SPEs  consist  of  a  Synergistic  Processing  Unit  (SPU)  and  Memory  Flow  Controller  (MFC).  The  MFC  transfers 
data  between  the  SPE  cores  as  well  as  between  the  local  store  and  the  system  memory  (Gschwind  et  ah,  2007).  Connection 
from  PPE  to  SPEs  is  made  via  the  Element  Interconnect  Bus  (EIB)  which  has  a  peak  bandwidth  of  204.8  GB/s  (Buttari  et 
ah,  2007). 

Condor  utilizes  the  PS3  as  a  computing  platform  for  access  to  the  Cell  BE.  The  PS3  is  equipped  with  the  Cell  BE  with 
minor  alterations.  Only  six  of  the  eight  SPEs  available  for  use  in  the  PS3;  one  SPE  is  disabled  for  yield  reasons  at  the 
hardware  level  and  one  SPE  is  reserved  solely  for  the  GameOS  (Buttari  et  ah,  2007).  For  use  in  the  Condor  Cluster, 
Centos  Linux  was  installed  on  the  PS3s.  Additionally,  of  the  total  256  MB  of  available  memory  for  the  Cell  Broadband 
Engine  only  200  MB  is  accessible  to  Linux  (Buttari  et  ah,  2007). 

The  Condor  Cluster  was  engineered  to  increase  the  combat  effectiveness  of  the  Department  of  Defense  through 
technological  advances  supported  by  high  performance  computing.  Next  generation  synthetic  aperture  radar  (SAR)  sensors 
strive  to  provide  surveillance  of  larger  areas  (30  km  diameter)  with  smaller  targets  at  resolutions  close  to  one  foot. 
Applications  such  as  this  demand  real-time  processing  of  over  200  sustained  TFLOPS.  This  surveillance  capability  can  be 
achieved  using  the  SAR  backprojection  algorithm,  a  computationally  intensive  algorithm  that  enables  every  pixel  to  focus 
on  a  different  elevation  to  match  the  contour  of  the  scene.  The  SAR  backprojection  algorithm  has  been  optimized  by  the 
AFRL  Information  Directorate  to  eliminate  nearly  all  double  precision  operations,  favoring  application  on  the  Cell 
Broadband  Engine.  NVIDIA  Tesla  GPGPU  cards  also  have  a  preference  for  single  precision  operations  which  critically 
enhances  the  algorithm  and  consequently  the  number  of  pixels  generated  for  a  30  km  surveillance  circle. 

3.  High  Performance  LINPACK 

The  LINPACK  benchmark  has  become  the  de  facto  standard  for  measuring  real  peak  computational  performance  of 
high-performance  computers  for  nearly  twenty  years.  FIPL  introduced  the  ability  to  address  scalability  in  the  LINPACK 
testing  environment,  in  order  to  accurately  measure  the  performance  of  larger,  parallel  distributed  memory  systems.  Since 
1993,  FIPL  has  been  used  to  formulate  the  TOP500  list  of  the  most  powerful  supercomputers  in  the  world  (Dongarra  et  ah, 
2001). 

FIPL  provides  an  implementation  of  the  LU  decomposition  for  solving  a  system  of  equations.  The  benchmark  includes 
the  ability  to  measure  the  accuracy  of  the  solution,  as  well  as  the  time  required  to  compute  it.  In  addition,  FIPL  requires  the 
use  of  the  Message  Passing  Interface  (MPI)  for  providing  inter-process  communication,  and  an  implementation  of  the 
Basic  Linear  Algebra  Subprograms  (BLAS)  for  the  linear  algebra  operations  library. 

Because  of  the  general  acceptance  of  FIPL  as  the  standard  measure  of  computational  performance,  Feng  et.  al.  chose  to 
adopt  the  benchmark  to  provide  the  FLOPS  metric  for  scalable  system  performance  as  it  relates  to  energy  efficiency  (Feng 
and  Cameron,  2007). 

3,1,  HPL  CUBA 

Fatica  (2009)  describes  an  implementation  of  FIPL  for  NVIDIA  Tesla  series  graphics  processing  units  (GPUs).  The 
approach  described  utilizes  the  CUBLAS  library  for  the  BLAS  implementation  and  requires  only  minor  modifications  to 
the  FIPL  code.  In  particular,  the  implementation  utilizes  the  GPU  as  a  co-processor  to  the  CPU,  executing  the  benchmark 
simultaneously  on  both  architectures.  Thus,  a  critical  component  to  achieving  maximum  performance  is  to  find  the 
optimum  division  of  processing  load  between  the  CPU  and  GPU. 
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The  only  modification  to  the  HPL  source  code  required  to  enable  execution  on  the  Tesla  series  GPUs  was  changing 
memory  allocation  calls  to  cudaMallocHost  calls.  Subsequent  acceleration  of  the  benchmark  is  achieved  by  intercepting 
calls  to  DGEMM  and  DTRSM  to  utilize  the  CUBLAS  library  routines.  Fatica’s  implementation  exploits  the  independence 
of  DGEMM  operations,  by  overlapping  them  on  the  CPU  and  GPU. 

We  used  CUDA  3.2  and  Open  MPI  1.4.3  to  execute  the  implementation  of  FtPL  on  Condor’s  GPGPU  compute  nodes. 

3.2.  HPL  Cell  Broadband  Engine  Architecture 

To  execute  FtPL  on  the  Cell  BE  of  the  PS3  we  used  a  modified  implementation  of  the  one  described  by  Kistler,  et.  al. 
(2009).  The  approach  described  was  targeted  for  the  IBM  BladeCenter  QS22,  with  two  IBM  PowerXCell  8i  processors. 
The  PowerXCell  8i  is  a  component  of  several  of  the  top  10  computers  on  the  GreenSOO  List  (http://www.green500.org). 
Our  implementation  has  been  modified  to  run  on  the  Cell  BE  available  in  the  PS3,  a  variant  similar  to  the  IBM 
BladeCenter  QS21.  As  previously  mentioned,  the  PS3  Cell  only  has  6  synergistic  processing  elements  (SPEs)  available  for 
computation,  as  opposed  to  the  eight  SPEs  available  on  the  PowerXCell  8i.  In  addition,  the  PowerXCell  8i  has  an  enhanced 
double  precision  unit  which  the  PS3  Cell  does  not  have  (Kistler  et  al,  2009). 

Contrary  to  the  approach  used  to  implement  FIPL  for  the  Tesla  series  GPUs,  Kistler  et.  al.  implemented  the  benchmark 
through  multiple  kernel  modifications.  In  particular,  the  most  compute-intensive  kernels  were  modified  to  exploit  the  key 
architectural  characteristics  of  the  PowerXCell  8i.  The  result  was  the  creation  of  an  FIPL  acceleration  library  (Kistler  et  al, 
2009). 

We  used  the  IBM  Cell  SDK  3.1  and  Open  MPI  1.4.3  to  execute  the  implementation  of  FIPL  on  Condor’s  PS3  nodes. 

4.  Test  Methodology 

To  measure  the  energy  efficiency  of  the  Condor  Cluster,  we  followed  the  Run  Rules  for  submission  to  the  Green500 
List.  This  consists  of  two  basic  steps:  (1)  executing  the  FIPL  benchmark  capable  of  achieving  peak  performance  on  the 
supercomputer  and  (2)  measuring  the  energy  consumption  of  the  supercomputer  while  running  the  benchmark.  It  is 
understood  that  in  many  cases  measuring  the  total  system  energy  consumption  is  not  feasible.  Therefore,  the  Run  Rules 
allow  for  measuring  power  at  a  subcomponent  (e.g.  lU  node,  rack,  etc.)  and  then  extrapolating  this  measurement  across  the 
entire  system  (Run  Rules,  http://www.green500.org). 

Given  the  uniqueness  of  the  system  and  its  heterogeneous  nature,  the  FIPL  benchmark  could  not  be  run  across  the 
entire  system  at  one  time.  Additionally,  we  were  not  able  to  measure  the  power  for  the  entire  system  at  a  central  location. 
Furthermore,  there  is  a  significant  difference  in  the  power  draw  between  the  PS3’s  and  head  compute  nodes  as  well  as  the 
computational  performance,  particularly  as  a  result  of  the  memory  limitations  of  the  PS3  architecture.  Therefore,  in  order  to 
measure  the  total  power  consumed  by  the  system,  the  supercomputer  had  to  be  broken  down  into  three  subcomponents:  two 
PS3’s,  one  NVIDIA  Cl 060  compute  node  with  two  NVIDIA  Cl 060s  and  one  Intel  Xeon  processor,  and  one  NVIDIA 
C2050  compute  node  with  two  NVIDIA  C2050s  and  one  Intel  Xeon  processor.  The  benchmark  was  executed  on  each  of 
the  three  subcomponents  and  the  power  for  each  unit  was  measured  in  isolation.  The  total  power  for  Condor  was  then 
determined  using  the  following  equation  where  P  is  power,  Ri„ax  is  the  maximum  performance  achieved  by  FIPL,  and  N  is 
the  number  of  units: 

P  total(d^max)  —  Nps3'PpS3(RmaxPS3)  +  Nci060'Pci06o(RmaxC106o)  +  Nc2050'Pc205o(PmaxC205o)  (1) 

We  use  a  similar  equation  to  (1)  to  estimate  the  peak  performance  R^ax  of  Condor. 

Prior  to  obtaining  the  results  reported  below,  the  FIPL  benchmark  was  optimized  for  each  of  the  three  subcomponents. 
Tuning  FIPL  to  achieve  the  maximum  performance  on  each  subcomponent  consisted  of  varying  a  selection  of  parameters 
and  running  several  cases  to  observe  the  peak  FLOPS  that  could  be  attained.  Documentation  on  performance  tuning  and 
setting  up  the  input  data  file  for  FIPL  was  referenced  to  assist  in  this  process.  One  of  the  most  critical  parameters  is 
determining  the  matrix  size,  N,  to  run.  This  decision  is  largely  determined  by  the  size  of  RAM  for  the  processor  being 
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tested.  A  listing  of  the  parameters  used  for  our  study  is  seen  in  Table  1.  In  addition,  we  show  the  memory  available  to  each 
subcomponent  and  the  percentage  of  memory  which  the  matrix  requires  when  running  HPL. 

Table  1  -  Parameters  used  for  HPL  execution 


Subcomponent 

Problem 

Size 

Block 

Size 

NBMIN 

NDIV 

Panel 

factorization 

Recursive 

factorization 

Broadcast 

RAM 

%RAM 

for 

NxN 

SONY  PS3 

5440 

128 

4 

2 

R 

L 

Bandwidth 

Reducing 

256MB 

88% 

NVIDIA  C2050 
compute  node 

51080 

512 

8 

2 

L 

R 

Increasing 

Ring 

24GB 

81% 

NVIDIA  C1 060 
compute  node 

51080 

256 

2 

2 

R 

L 

Increasing 

Ring 

24GB 

81% 

To  measure  the  power  consumption  of  the  subcomponents  we  used  the  “Watts  Up?  Pro  ES”  and  followed  the  Power 
Measurement  Tutorial  by  Ge  et.  al.  (2006).  The  Watts  Up?  Pro  ES  is  a  digital  power  meter  with  a  PC  interface.  The  meter 
collects  data  in  one  second  intervals  and  stores  the  results  in  internal  memory  until  connected  to  a  PC.  Upon  completion  of 
a  set  of  tests  the  data  was  downloaded  to  the  PC  via  USB  for  recording  and  processing  power  data;  Watts  Up?  Download 
Software  was  used  to  collect  the  data  from  the  device. 

The  same  method  was  used  for  capturing  the  power  consumption  of  each  subcomponent.  Prior  to  powering  on  and 
executing  HPL,  the  subcomponent  power  cord  was  connected  to  the  power  meter,  which  was  subsequently  connected  to  the 
on-rack  power  strip.  The  only  difference  was  for  the  PS3s,  in  which  we  connect  two  PS3s  to  a  power  strip  and  then 
connected  the  power  strip  to  the  meter.  Two  PS3s  were  monitored  because  the  HPL  implementation  used  was  written  for  a 
QS22  containing  two  Cell  BE  processors.  Each  subcomponent  was  then  powered  on  and  allowed  to  run  for  approximately 
15  minutes.  This  allowed  the  computers  to  stabilize  and  to  get  accurate  readings  of  the  average  idle  power  consumption  of 
each  subcomponent.  After  the  stabilization  period,  we  executed  the  HPL  code  for  each  particular  subcomponent  using  the 
parameters  determined  above  for  achieving  maximum  performance.  Though  the  Green500  Run  Rules  state  that  it  is 
sufficient  to  measure  power  consumption  for  a  minimum  of  20%  of  the  HPL  runtime,  we  measured  consumption  over  the 
entire  run.  In  addition,  the  Run  Rules  state  that  only  two  runs  are  necessary  -  given  a  tolerance  of  less  than  1%  in  power 
variation  between  the  two  -  yet  we  chose  to  run  these  tests  10  and  20  times  for  the  Tesla  GPUs  and  PS3s,  respectively. 

5.  Results 

The  results  presented  below  show  the  energy  efficiency  performance  of  Condor  at  the  subcomponent  level.  We  present 
the  energy  consumption  of  the  subcomponent  running  HPL  versus  the  average  idle  consumption,  and  calculate  the  energy 
efficiency  in  GFLOPSAV  using  the  peak  performance  achieved  on  HPL. 

Over  the  course  of  20  runs  on  two  PS3  nodes  the  average  power  consumption  showed  little  variation  while  executing 
the  HPL  benchmark.  The  average  power  draw  for  two  PS3s  while  running  the  benchmark  at  peak  performance  was 
observed  to  be  199.95  W.  As  compared  to  the  power  draw  while  idle,  the  increase  in  the  amount  of  power  required  to 
execute  the  peak  performance  of  the  HPL  benchmark  is  very  low,  as  shown  in  Figure  1 .  This  demonstrates  the  efficiency  of 
the  PS3  while  running  computationally  intensive  problems. 
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■  HPL 

■  Idle 


Figure  1  -  Power  consumption  of  two  PlayStation  3  nodes  executing  the  HPL  benchmark. 

When  idie,  the  two  PS3s  consume  188.49  W  on  average.  At  peak  HPL  performance,  the 
nodes  draw  an  average  of  199.95  W,  an  additionai  load  of  approximately  5.73  W  per  node. 

Figure  2  shows  the  results  of  each  run  on  the  PS3  in  terms  of  GigaFLOPS  (GFLOPS)  achieved  and  the  average  power 
consumption  over  the  entire  run.  There  is  an  apparent  relationship  between  the  peak  performance  that  is  achieved  and  the 
power  consumed  by  the  nodes.  In  most  cases,  slightly  higher  power  consumption  was  witnessed  when  the  performance  was 
greater.  A  similar  relationship  was  observed  on  each  of  the  subcomponents  tested. 

The  experimental  average  peak  performance  of  the  PS3s  was  determined  to  be  10.46  GFLOPS.  Thus,  at  an  average 
rate  of  199.95  W  consumed,  the  energy  efficiency  for  the  PS3s  can  be  calculated  as  .052  GFLOPSAV  (52  MFLOPSAV). 
Such  a  rating  would  be  sufficient  to  place  the  PS3  nodes  in  the  20*  percentile  of  the  November  2010  Green  500  List. 
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Figure  2  -  Performance  of  the  HPL  benchmark  on  two  PlayStation  3  nodes.  Peak  performance 
measured  as  output  from  HPL,  while  power  consumption  is  measured  as  the  average  over 
the  duration  of  the  HPL  execution. 

For  comparison,  the  theoretical  peak  performance  of  a  single  PS3  node  is  10.97  GFLOPS.  Thus,  the  peak  performance 
for  two  PS3s  is  21.9  GFLOPS.  Experimentally,  we  achieved  48%  of  peak  performance  for  the  FtPL  benchmark.  However, 
we  expected  this  poor  performance  because  the  PS3  Cell  BE  is  not  optimized  for  double  precision  computation.  On  the 
other  hand,  a  single  PS3  node  could  achieve  153  GFLOPS  in  single  precision. 
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Figure  3  -  Power  consumption  of  a  compute  node  with  dual  NVIDIA  C2050  GPUs 
executing  the  HPL  benchmark.  When  idle,  the  node  consumes  368.991  W  on  average.  At 
peak  HPL  performance,  the  node  draws  an  average  of  639.59  W,  an  additional  load  of 
approximately  270.6  W. 


The  NVIDIA  C2050  compute  nodes  demonstrated  higher  power  consumption,  particularly  when  compared  to 
consumption  over  idle  use,  but  also  showed  significant  improvements  in  HPL  performance.  Figure  3  shows  that  the 
average  idle  power  consumption  of  the  NVIDIA  C2050  compute  nodes  is  368  W.  When  operating  at  peak  performance,  we 
observed  that  the  nodes  consumed  639  W  on  average.  This  represents  a  73%  increase  in  consumption. 

Figure  4  shows  the  results  of  each  run  on  the  C2050  compute.  The  experimental  average  peak  performance  for  the 
C2050  compute  node  was  observed  to  be  619.5  GFLOPS,  which  equates  to  54%  of  the  theoretical  1.158  TFLOPS  for  these 
nodes  (i.e.  128  GFLOPS  for  the  Intel  processor  and  515  GFLOPS  per  NVIDIA  C2050).  The  energy  efficiency  for  the 
C2050  compute  nodes  can  be  calculated  as  .966  GFLOPSAV  (966  MFLOPSAV).  This  efficiency  would  place  the  C2050 
compute  nodes  in  the  99*  percentile  of  the  November  2010  Green500  List. 
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Figure  4  -  Performance  of  the  HPL  benchmark  on  a  compute  node  with  dual  NVIDIA  C2050 
GPUs.  Peak  performance  measured  as  output  from  HPL,  while  power  consumption  is 
measured  as  the  average  over  the  duration  of  the  HPL  execution. 


The  NVIDIA  Cl 060  compute  nodes  demonstrated  lesser  power  consumption  to  the  C2050  compute  nodes.  Figure  5 
shows  the  average  consumption  of  the  Cl 060  compute  nodes  when  idle  as  compared  to  the  average  consumption  for  each 
run  of  HPL.  The  average  idle  power  consumption  of  the  NVIDIA  Cl  060  compute  nodes  is  337  W.  When  operating  at  peak 
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performance,  we  observed  that  the  nodes  consumed  506  W  on  average.  This  represents  an  approximate  50%  increase  over 
idle  performance. 

However,  unlike  the  C2050  compute  nodes,  the  Cl 060  nodes  are  not  fully  optimized  for  double  precision 
computations.  In  particular  it  is  the  C1060  which  does  not  perform  optimally,  as  the  Intel  processors  are  the  same  as  those 
on  the  C2050  nodes.  The  theoretical  peak  performance  of  a  C1060  for  single  precision  is  933  GFLOPS.  However,  the 
theoretical  peak  performance  for  double  precision  is  78  GFLOPS  (http://www.nvidia.com).  Conversely,  the  C2050 
performs  at  1.3  TFLOPS  in  single  precision  and  515  GFLOPS  for  double  precision  (http://www.nvidia.com).  As  a  result, 
we  observed  much  lower  performance  on  the  Cl 060  compute  nodes  at  an  average  of  1 18  GFLOPS,  or  42%  of  the  peak. 
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Figure  5  -  Power  consumption  of  a  compute  node  with  dual  NVIDIA  C1060  GPUs 
executing  the  HPL  benchmark.  When  idle,  the  node  consumes  336.94  W  on  average.  At 
peak  HPL  performance,  the  node  draws  an  average  of  506.85  W,  an  additional  load  of 
approximately  169.85  W. 


Figure  6  shows  the  results  of  each  mn  on  the  C1060  compute.  With  an  average  performance  of  118  GFLOPS  and 
average  power  consumption  of  506  W,  the  energy  efficiency  for  the  Cl 060  compute  nodes  can  be  calculated  as  .223 
GFLOPS/W  (223  MFLOPS/W).  This  efficiency  would  place  the  C1060  compute  nodes  in  the  75*  percentile  of  the 
November  2010  Green500  List. 
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Figure  6  -  Performance  of  the  HPL  benchmark  on  a  compute  node  with  dual  NVIDIA  C1 060 
GPUs.  Peak  performance  measured  as  output  from  HPL,  while  power  consumption  is 
measured  as  the  average  over  the  duration  of  the  HPL  execution. 
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Our  results  across  all  three  node  classes  are  shown  in  Table  2.  Using  Equation  (1)  from  above,  we  can  calculate  the 
overall  energy  efficiency  of  Condor  to  be  approximately  .192  GFLOPSAV  (192  MFLOPSAV).  This  rating  reflects  41.7 
TFLOPS  of  double  precision  performance  and  217.3  KW  of  consumed  power. 

Table  2  -  Observed  Energy  Efficiency  of  Condor  by  Subcomponent 


Subcomponent 

#  of  Nodes 

Avg  Watts  Per  Node 

GFLOPS  Per  Node 

GFLOPS/W 

SONY  Playstation  3 

1716 

99.98 

5.23 

.052 

NVIDIA  C2050  compute  node 

47 

639.59 

619.5 

.966 

NVIDIA  Cl 060  compute  node 

31 

506.85 

118.3 

.233 

While  our  method  for  measuring  the  average  power  consumption  of  the  nodes  is  consistent  with  the  methodology 
prescribed  by  the  GreenSOO,  we  realize  that  isolation  of  a  single  node  for  running  FIPL  and  then  extrapolating  the  results 
across  the  entire  supercomputer  is  not  consistent  with  the  TOP500  run  rules.  Parallelization  of  the  benchmark  across  the 
entire  supercomputer  would  introduce  degradations  on  the  overall  performance,  e.g.  due  to  communication  and 
coordination  between  the  nodes.  What  we  present  here  can  thus  be  described  as  an  experimentally-rooted  theoretical 
maximum  for  the  energy  efficiency  performance  of  Condor.  In  practice,  we  would  expect  the  overall  peak  performance  of 
FIPL  to  drop  slightly  when  utilizing  the  full  cluster. 

6,  Conclusions  and  Future  Work 

In  a  time  where  the  drive  for  advancing  computer  systems  has  been  dominated  by  peak  performance  at  any  cost,  the 
GreenSOO  List  challenges  emerging  developers  to  examine  another  key  aspect  to  advanced  computing,  namely,  energy 
efficiency.  Not  only  has  the  cost  to  operate  top-of-the  line  supercomputers  soared  beyond  a  million  dollars  per  year,  but  the 
excessive  power  consumption  of  these  emerging  supercomputer  is  negatively  impacting  the  environment,  making  energy 
efficiency  a  necessity  in  system  design.  The  GreenSOO  List  provides  a  ranking  system  where  the  performance  per  Watt 
metric  has  not  only  taken  precedence  over  other  metrics,  but  has  been  encouraged  as  a  primary  consideration  in  new 
designs. 

We  demonstrated  here  that  the  Condor  Cluster  is  capable  of  achieving  energy  efficiency  performance  that  would  place 
in  the  top  35%  of  the  most  recent  Green500  list  (http://www.green500.org).  Flowever,  the  computational  performance  is 
limited  with  respect  to  FIPL  because  the  Cell  BE  and  NVIDIA  Cl 060  are  not  optimized  for  double  precision  floating  point 
operations. 

Flowever,  for  the  majority  of  the  applications  run  on  the  Condor  Cluster  single  precision  operation  is  sufficient;  as 
such  the  design  model  for  the  supercomputer  was  not  intended  to  achieve  extraordinary  double  precision  performance.  We 
consider  exploration  of  mixed-precision  approaches  to  FIPL  (Kurzak  &  Dongarra,  2006)  or  other  single  precision 
benchmarks  as  an  area  of  future  research  to  demonstrate  the  efficiency  of  Condor  in  its  targeted  niche  of  computation. 

A  key  design  concept  of  Condor  was  to  bring  the  three  critical  drivers  in  supercomputer  design  -  peak  performance, 
price/performance,  and  performance/Watt  -  together  into  a  unique  and  highly  sustainable  system  capable  of  solving  some 
of  the  military’s  most  critical  information  processing  problems.  Fligh  performance  computing  systems  are  designed  to 
achieve  a  peak  performance  based  on  their  desired  applications.  Condor  is  capable  of  sustaining  a  peak  performance  of 
200-300  TFLOPS  required  to  perform  several  important  military  applications.  While  the  cost  of  engineering  a  high 
performing  supercomputer  can  be  very  expensive.  Condor  was  built  using  commodity  game  consoles  and  graphics 
processors  that  achieve  performance  comparable  to  specialized  architectures  at  a  fraction  of  the  cost.  With  a  total  cost  of 
$2.5M,  the  price/performance  ratio  far  exceeds  that  of  comparable  systems.  Finally,  we  have  demonstrated  the  energy 
efficiency  of  Condor  to  be  0.192  GFLOPS/W.  The  energy  needs  of  Condor  can  be  translated  into  sustainability  costs  on 
the  order  of  $0.5M  per  year.  Thus,  the  Condor  Cluster  is  a  powerful,  yet  highly  sustainable  asset  for  the  Air  Force  Research 
Laboratory  and  Department  of  Defense. 
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